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Abstract. A data structure, called a biased range tree, is presented that preprocesses a set S of n points 
in R 2 and a query distribution D for 2-sided orthogonal range counting queries. The expected query time 
for this data structure, when queries are drawn according to D, matches, to within a constant factor, 
that of the optimal decision tree for S and D. The memory and preprocessing requirements of the data 
structure are O(nlogn). 

^ 1 Introduction 

u 

Let S be a set of n points in R 2 and let D be a probability measure over R 2 . A 2-sided orthogonal range 
counting query over S asks, for a query point q = (q x , q y ) , to report the number of points (p x ,p y ) e £ such 
that p x > q x and p y >q y .A 2-sided range counting query has distribution D if the query point q is chosen 
from the probability measure D. If T is a data structure for answering 2-sided range counting queries 
over S then we denote by [i^ (T) the expected time, using T, to answer a range query with distribution 
D. The current paper is concerned with preprocessing the pair (S, D) to build a data structure T that 
C minimizes hd(T). 

(N 

\6 
o 

qq 1-1 Previous Work 

O 

. J_h The general topic of geometric range queries is a field that has seen an enormous amount of activity in 

^ the last century Results in this field depend heavily on the types of objects the data structure stores and 

on the shape of the query ranges. In this section we only mention a few data structures for orthogonal 
range counting and semigroup queries in 2 dimensions. The interested reader is directed to the excellent, 
and easily accessible, survey by Agarwal and Erickson J9) . 

Orthogonal range counting is a classic problem in computational geometry. The 2- (and 3- and 
4-) sided range counting problem can be solved by Bentley's range trees [3]. Range trees use O(nlogn) 
space and can be constructed in O(nlogn) time. Originally, range trees answered queries in 0(log 2 n) 
time. However, with the application of fractional cascading ||6l ITTT1 the query time can be reduced to 
(9(log n) without increasing the space requirement by more than a constant factor. Range trees can also 
answer more general semigroup queries in which each point of S is assigned a weight from a commutative 
semigroup and the goal is to report the weight of all points in the query range H101 115H . 
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For 2-sided orthogonal range counting queries, Chazelle ]4l [SJ proposes a data structure of size 
O(n), that can be constructed in 0(n log n) time, and that can answer range couting queries in O(logn) 
time. Unfortunately, this data structure is not capable of answering semigroup queries in the same 
time bound. For semigroup queries, Chazelle provides data structures with the following requirements: 
(1) 0(n) space and 0(log 2+e n) query time, (2) 0(n log log n) space and O (log 2 n log log n) query time, 
and (3) 0(nlog e n) space and <3(log 2 n) query time. 

Practical linear space data structures for range counting include fc-d trees [2], quad-trees ltT3l . 
and their variants. These structures are practical in the sense that they are easy to implement and use 
only 0(n) space. Unfortunately, neither of these structures has a worst-case query time of log°^ n. 
Thus, in terms of query time, fc-d trees and quad-trees are nowhere near competitive with range trees. 

Despite the long history of data structures for orthogonal range queries, range trees with frac- 
tional cascading are still the most effective data structure for 2-sided orthogonal range queries in the 
semigroup model. In particular, no data structure is currently known that uses o(n log n) space and can 
answer 2-sided orthogonal range queries in 0(log n) time. 



1.2 New Results 

In the current paper we present a data structure, the biased range tree, for 2-sided orthogonal range 
counting. Biased range trees fit into the comparison tree model of computation, in which all decisions 
made during a query are based on the result of comparing either the x- or y-coordinate of the query point 
to some precomputed values. Most data structures for orthogonal range searching, including range trees, 
fc-d trees and quadtrees, fit into the comparison tree model. This model makes no assumptions about 
the x- or ^-coordinates of points other than that they each come from some (possibly different) total 
order. This is particularly useful in practice since it avoid the precision problems usually associated with 
algebraic decisions and allows the mixing of different data types (one for x-coordinates and one for 
y-coordinates) in one data structure. 

A biased range tree has size 0(n\ogn), can be constructed in O(nlogn) time, and can answer 
range counting (or semigroup) queries in 0(/x£>(T*)) expected time, where T* is any comparison tree 
that answers range counting queries over S. In particular, T* could be a comparison tree that minimizes 
li D (T*) implying that the expected query time of our data structure is as fast as the fastest comparison- 
based data structure for answering range counting queries over S. Moreover, the worst-case search time 
of biased range trees is O(logn), matching the worst-case performance of range trees. 

Note that we do not place any restrictions on the comparison tree T* . Biased range trees, while 
requiring only O(nlogn) space, are competitive with any comparison-based data structure. Thus, the 
memory requirement of biased range trees is the same as that of range trees but their expected query 
time can never be any worse. 

The remainder of the paper is organized as follows. In Section |2]we present background material 
that is used in subsequent sections. In Section[3]we define biased range trees. In Section|4]we prove 
that biased range trees are optimal. In Section |5]we recap, summarize, and describe directions for future 
work. 
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2 Preliminaries 



In this section we give definitions, notations, and background that are prerequisites for subsequent 
sections. 



Rectangles. For the purposes of the current paper, a rectangle R(a, b, c, d) is defined as 

R(a, b, c, d) = {(x, y) : a < x < b and c < y < d} . 

We also allow unbounded rectangles by setting a,c = — oo and/or b,d = oo. Therefore, under this 
definition, rectangles can have 0, 1, 2, 3, or 4 sides. For a query point q = (q x ,q y ) we denote by R(q) 
the query range R(q x: oo, q y , oo). A horizontal strip is rectangle of the form R(—oo, oo, c, d) and a vertical 
strip is a rectangle of the form R(a, b, — oo, oo). 



Classification Problems and Classification Trees. A classification problem over a domain V is a func- 
tion V : V i—* {0, . . . , k — 1}. The special case in which k = 2 is called a decision problem. A d- 
ary classification tree is a full d-ary treq^] in which each internal node v is labelled with a function 

P„ : V i— > {0, , d — 1} and for which each leaf t is labelled with a value in {0, . . . , k — 1}. The 

search path of an input q in a classification tree T starts at the root of T and, at each internal node v, 
evaluates i = P v (q) and proceeds to the ith child of v. We denote by T(q) the label of the final (leaf) 
node in the search path for q. We say that the classification tree T solves the classification problem V 
over the domain V if, for every q e V, V(q) = T(q). 

The particular type of classification trees we are concerned with are comparison trees. These are 
binary classification trees in which the function P v at each node v compares either q x or q y to a fixed 
value (that may depend on the point set S and the distribution D). For the problem of 2-sided range 
counting over S, the leaves of T are labelled with values in {0, ... , l^l} and T(q) = \R(q) n S\ for all 
q 6 R 2 . 



Probability. For a probability measure D and an event X, we denote by D\ X the distribution D 
conditioned on X. That is, the distribution where the probability of an event Y is Pr(Y | X) = 
Pr(y n X)/Pr(X). The probability measures used in this paper are usually defined over R 2 . We make 
no assumptions about how these measures are represented, but we assume that an algorithm can, in 
constant time, given a rectangle r, determine Pr(r). 

For a classification tree T that solves a problem P:X>i— >{0,...,fe — 1} and a probability measure 
D over V, the expected search time of T, denoted by //^(T), is the expected length of the search path for 
q when q is drawn at random from V according to D. Note that, for each leaf I of T there is a maximal 
subset r(£) C V such that the search path for any q e r{l) ends at £ Thus, the expected search time of 
T (under distribution D) can be written as 

vd{t) - Pr ( r W) x d ^ > 

where L(T) denotes the leaves of T and dr(£) denotes the length of the path from the root of T to I. 
When the tree T is obvious based on context we will sometimes use the notation d(£) to denote dr{tj. 

1 A full d-ary tree is a rooted ordered tree in which each non-leaf node has exactly d children. 



3 



Note that, for comparison trees, the closure of r(£) is always a rectangle. For a node v in a tree, we will 
use the phrases depth ofv and level ofv interchangeably and they both refer to d(v). 

The following theorem is a restatement of (half of) Shannon's Fundamental Theorem for a 
Noiseless Channel [14, Theorem 9]. 

Theorem 1. Let V : T> i— > {0, . . . , k — 1} be a classification problem and let p e V be selected from a 
distibution D such that Pr{V(p) = i} = Pi, for < i < k. Then, any d-ary classification tree T that solves 
Vhas 

fc-i 

MT) >5>ilog d (l/ ft ) . (1) 

i=0 

In terms of range counting, Theorem [T] immediately implies that, if pi is the probability that 
the query range contains i points of S, then any binary decision tree T that does range counting has 
Hd(T) > Y^i=oPi l°g(l/P»)- Unfortunately for us, this lower bound is too weak and, in general, there is 
no decision tree whose performance matches this obvious entropy lower bound. 

A stronger lower bound on the cost of range searching can be obtained by considering the 
arrangement A of 2n rays obtained by drawing two rays originating at each point of S, one to the left 
and one downwards (see Figure [T]a). This arrangement partitions the plane into a set of faces F(A). 
If T is a comparison tree for range counting in S, then there is no leaf I of T such that the interior 
of r{£) intersects any edge of A since otherwise there are query points q in the neighbourhood of this 
intersection for which T(q) ^ \R(q) H S\. Therefore, by relabelling the leaves of T with the faces of A, 
we obtain a data structure for determining which face of A contains the query point q. By Theorem [T] 
this implies that 

I*d(T)> J2 Pr(/)log(l/Pr(/)) . 

feF(A) 

Unfortunately, this bound is still not strong enough and, in general, there is no decision tree T that 
matches this lower bound. To see this, consider Figure [T]b, when the query point q is uniformly dis- 
tributed among the n + 1 shaded circles. In this case, q is always in the same face of A so the lower 
bound given above is 0. Nevertheless, it is not hard to see that the leaves of any decision tree T for range 
searching in S can be relabelled to determine which of the n+ 1 circles contains q, so hd(T) > log(n+ 1). 



Biased Search Trees. Biased search trees are a classic data structure for solving the following 1- 
dimensional problem: Given an increasing sequence of real numbers X = {x a = — oo, x\,X2, ■ ■ ■ , x n ,x n+ \ = 
oo ) and a probability distribution D over R, construct a binary search tree T — T(X, D) so that, for any 
query value q drawn from D, one can quickly find the unique interval [x%, aft+i) containing q. \tpi is the 
probability that q £ [xj, a^+i) then the expected number of comparisons performed while searching for 
q is given by 

n 

Hd{T) <^ftlog(lM) + l 

i=l 

and the tree T can be constructed in 0(n) time 11211 . Clearly, by Theorem[T] the query time of this binary 
search tree is optimal up to an additive constant term. Note that, by having each node of T store the size 
of its subtree, a biased search tree can count the number of elements of X in the interval I(q) = [q, oo) 
without increasing the search time by more than a constant factor. Thus, biased search trees are an 
optimal data structure for 1 -dimensional range counting. 
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(a) 



(b) 



Figure 1 : (a) The distribution of the query point q over the faces of the arrangement A gives a lower 
bound on the cost of any comparison tree for range counting in S. (b) The lower bound is not always 
achievable by a comparison tree. 

3 Biased Range Trees 



In this section we describe the biased range tree data structure, which has three main parts: the backup 
tree, the primary tree, and a set of catalogues that adorn the nodes of the primary tree. 



3.1 The Backup Tree 

In trying to achieve optimal query time, biased range trees will try to quickly answer queries that are, 
in some sense, easy. In some cases, a query is difficult and it cannot be answered in o(log n) time. For 
these queries, a backup range tree that stores the points of S and can answer any 2-sided range query in 
0(log n) worst-case time is used. The preprocessing time and space requirements of this backup tree are 

0{n log n) M- 



3.2 The Primary Tree 

Like a range tree, a biased range tree is an augmented data structure consisting of a primary tree whose 
nodes store secondary structures. However, in a range tree the primary tree is a binary search tree that 
discriminates based only on the x-coordinate of the query point q. In order to achieve optimal expected 
query time, this turns out to be insufficient, so instead biased range trees use a variation of a fc-d tree as 
the primary tree. 

The primary tree is constructed in a top-down fashion. Each node v of T is associated with a 
region r(v) whose closure is a rectangle. The region associated with the root of T is all of R 2 . We say 
that a node v is bad if its depth is at least |~log 2 n\ and r(v) n S ^ 0. A node v is split if v its depth is less 
than |~log 2 ri], and r(v) flS ^ 0. The two children of a split node v are associated with the two regions 
obtained by removing a horizontal or vertical strip s(v) from r(v) depending on whether the depth of v 
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Figure 2: The splitting of (a) a vertical node v and (b) a horizontal node v. 

is even or odd, respectively. We call a node v at even distance from the root a vertical node, otherwise 
we call v a horizontal node. 

Refer to Figure[2} For a vertical node v, we denote its children by left(u) and right (u) and call 
them the left child and right child of v, depending on which side of the vertical strip (left or right) they 
are. For uniformity we will also call the children of a node v that is split with a horizontal strip left(u) 
and right(u). The child below the strip is denote by left(u) and the child above the strip is denoted by 
right(w). Similarly the left and right boundaries of a strip s(v) at a horizontal node v refer to the bottom 
and top sides of s(v). Note that, with these conventions, if the query point q is in r(left(w)) then R(q) 
intersects r(right(u)). However, if q e r(right(i;)) then R(q) does not intersect r(left(w)). Similarly, for a 
query point q e s(v), the query range R(q) intersects r(right(w)) but not r(left(w)) 

All that remains is to define the strip s(v) for each node v. If v is a leaf then we use the 
convention that s(v) — r(v). If v is not a leaf then s(v) C r(v) is selected as a maximal strip containing 
no point of r(v) n S in its interior, that is closed on its right side and open on its left side and such 
that each of the at most two components of r(v) \ s(v) has probability at most Pr(r(u))/2. Suppose v 
is a vertical node. Then let r(v)x, . . . , r(v)k, be a partitioning of r(v) into strips, in left-to-right order, 
obtained by drawing a vertical line through each of the fc points in S n r(v). We use the convention that 
each strip is closed on its right side and open on its left side. Then there is a unique strip s(v) — r(v)i 
such that X)j=i P?(r(v)j) < Pr(r(u))/2 and ^ Pr(r(u)j) < Pr(r(u))/2. For a horizontal node v, the 
definition of s(v) is analagous except we use horizontal lines through each point of r(v) n S. 

Note that for a node v that is not a leaf, we use the convention that s(v) contains its right side 
but not its left side and that r(right(u) and r(left(u)) are the two components of r(v) \ s(v). This implies 
that r(left(w)) and/or r(right(w)) may be empty, in which case left(w), respectively, right(w) is a leaf of T. 
With these definitions, for any point q e R 2 there is exactly one vertex v of T such that q e s(v). 

The following two properties are easily derived from the definition of T and are necessary to 
prove the optimality of biased range trees: 

1. Any node v at depth i in T has Pr(s(u)) < Pr(r(u)) < 1/2*. 
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C x (Mt(v)) 
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C y (Mt(v)) lcft(w) right (v) 

(b) 



Figure 3: The catalogues of (a) a horizontal node v and (b) a vertical node v. 



2. For any node v of T, if Pr(r(u)) > 0, then the closure of r(v) contains at least one point of S. 



Point 1 above follows immediately from the definition of s(v). Next we explain the logic leading 
to Point 2. If r(v) contains a point of S then so does the closure of r(v). If r(v) = 0, then Pr(r(i;)) = 0. 
Otherwise, r(v) ^= and r(v) has no point of S in its interior. Then consider the parent w of v. Since s(w) 
does not contain r(v) there must be a point of S 1 on the boundary of s(w) that is also on the boundary of 
r(v). Therefore r(v) contains this point in its closure. 



3.3 The Catalogues 



The nodes of the tree T are augmented with additional data structures called catalogues that hold subsets 
of S. Each node v has two catalogues, C x (v) and C y (v) that store subsets of S sorted by their x-, 
respectively, y-, coordinate. Intuitively, C x (v) stores points that are "above" r(v) and C y (v) stores points 
that are "to the right of" r(v). (Refer to Figure [3]) More precisely, if v is a horizontal node, then 

C x (Mk(v)) = (s(v) U r(right(u))) n S and C y (Mt(v)) = 0. If v is a vertical node, then Cj,(left(v)) = 
(s(v) U r(right(w))) n S and Ca;(left(i;)) = 0. For any node v that is the root of T or a right child of its 
parent, C x {v) = C y (v) = 0. 

Consider any node v that is not a bad leaf and any point q e s(v). If v has a left child then let 
v\ = left(u), otherwise, let v\ = v. Let v\, . . . , denote the path from v\ to the root of T (see FigureQ. 
Then the catalogues of v%, . . . ,Vf. have the following properties: 



1. The points in the catalogues of vi, . . . , Vk are above or to the right of q. That is, for each 1 < i < k, 
all points in C y (vi), respectively, C x (v{) have their x-, respectively, y-, coordinate greater than or 
equal to q x , respectively, q y . 
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Figure 4: The area covered by catalogues on the path v to the root of T. The x symbol shows the 
location of the query point q. 

2. All catalogues at nodes in w 1; . . . , w fe are disjoint. That, is, for each 1 < i < j < k, C x {vi) n C x (vj) = 
0, C y {vi) n C v (vj) = 0, C x {vi) n C y ( Vj ) = 0, and C x ( Vj ) n C v {vi) = 0. 

3. The catalogues at nodes vx,...,Vk contain all points in the query range R(q). That is, 

k 

R(q)nSc\J(C x (v i )UC y (v i )) . 

i=l 

Note that, points 1, 2 and 3 above imply that determining \R(q) n S\ can be done by solving a 
sequence of 1-sided range queries in the x- and y-catalogues of vx, . .. ,Vk- However, performing these 
queries individually would take too long. 

To speed up the process of navigating the catalogues of T, fractional cascading |j6ll is used. 
Starting at the root of T and as long as v is not a leaf, a fraction of the data in C x (v) is cascaded into 
C x (right (v)) and C x (left(v)). As well, a fraction of the data in C y (v) is cascaded into both C v (right (v)) 
and C y (left(v)). Note that this cascading is done only to speed up navigation between the catalogues 
of T. Although fractional cascading introduces extra data into the catalogues of T we will continue to 
use the notations C x (v) and C y (v) to denote the set of points contained in the catalogues of v before 
fractional cascading takes place. 

Finally, each catalogue C x (v) and C y (v) is indexed by a biased binary search tree T x (v), respec- 
tively, T y (v). If v is the left child of its parent, then the weight of an interval (a, b) in T x (v), respectively, 
T y (v) is given by the probability that q x , respectively, q y , is in the interval (a, 6] when q is drawn ac- 
cording to the distribution -D| s ( P aront(t)))- Otherwise (v is not a left child), the weight of an interval is 
determined by the distribution D\ a M- 
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3.4 Construction Time and Space Requirements 

The biased range tree data structure is now completely denned. The structure consists of a backup tree, 
a primary tree, and the catalogues of the primary tree. We now analyze the construction time and space 
requirements of biased range trees. 

The backup tree has size 0(n log n) and can be constructed in 0(n log n) time ||8j Theorem 5.11]. 
To construct the primary tree quickly we presort the points of S by their x and y coordinates. Since the 
primary tree has height O(logn), it is then easily constructed in O(nlogn) time. Ignoring any copies of 
points created by fractional cascading, each point in S occurs in at most 2 catalogues at each level of 
the primary tree. Thus, the sizes of all catalogues (before fractional cascading) is 0(n log n) and these 
catalogues can be constructed in 0{n\ogn) time (because of elements of S are presorted; see de Berg et 
al ]8l Section 5.3] for details). The fractional cascading between catalogues does not increase the size of 
catalogues by more than a constant factor since each catalogue is cascaded into only a constant number 
of other catalogues J6l • 

In summary, given the point set S and access to the distribution D, a biased range tree for (S, D) 
can be constructed in 0(n log n) time and requires 0(n log n) space. 



3.5 The Query Algorithm 

The algorithm to answer a 2-sided range query q = (q x ,q y ) proceeds in three steps: 

1. The algorithm navigates the tree T from top to bottom to locate the unique node v such that 
q G s(v). This step takes 0(dr(q)) time, where dr{q) is the depth of the node v. If v is a bad leaf 
(so dx(q) > logn) then the algorithm performs a range query in O(logn) time using the backup 
range tree and the query algorithm does not execute the next two steps. 

2. If v has a left child then let u = left(-u), otherwise let u = v. The algorithm uses T x (u) and T y (u) to 
locate q x and q y , respectively, in the catalogues C x (u) and C y (u), respectively. 

3. The algorithm walks back from u to the root of T, locating q in the catalogues of all nodes on 
this path and computing the results of the range counting query as it goes. Thanks to fractional 
cascading, each step of this walk can be done in constant time, so the overall time for this step is 
also 0{d T {q)). 

Observe that Steps 1 and 3 of the query algorithm each take 0(c?t(<z)) time. The time needed to 
accomplish Step 2 of the algorithm depends on exactly what is in the catalogues C x (u) and C y (u), and 
will be the first quantity we study in the next section. 



4 Optimality of Biased Range Trees 



In this section we show that the expected query time of biased range trees is as good as the expected 
query time of any comparison tree. The expected query time has two components. The first component 
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is the expected depth, dr{q), of the node v such that s(v) contains q. The second component is the 
expected cost of locating q in the catalogues of u (recall that u = lcft(w) or u = v if v has no left child). 
We will show that each of these two components is a lower bound on the expected cost of any decision 
tree for two-sided range searching on S where queries come from distribution D. In order to simplify 
notation in this section we will use the convention Pi(v) = Pr(s(v)) is the probability that a search 
terminates at node v of T. 



4.1 The Catalogue Location Step 



First we show that the expected cost of locating q in the two catalogues, C x (u) and C y (u) is a lower 
bound on the expected cost of any decision tree for answering 2-sided range queries in S. The intuition 
behind this proof is that, in order to correctly answer range counting queries, any decision tree for 
range counting must locate the x-coordinate of q with respect to the x-coordinates of all points above q. 
Similarly, it must locate the y-coordinate of q with respect to the y-coordinates of all points to the right 
of q. The structure of the catalogues ensures that biased range trees do this in the most efficient manner 
possible. 

Lemma 1. Let S be a set of n points and let D be a probability measure over R 2 . Let T* be any decision 
tree for 2-sided range counting in S and let C 2 (S, D) denote the expected cost of locating q in Step 2 of the 
biased range tree query algorithm on the biased range tree T = T(S, D). Then 

^ D (T*)=fl(C 2 (S,D)) . 



Proof. We first observe that, by definition, 

C 2 (S, £>) = £ Pr(v) (hd ]s(v) (T x (u)) + » D]s(v) (T y (u))) . 

Consider some node v of T. For a point q <s s(v), all of the points in T x (v) are points that may or may 
not be in the query range R(q) depending on where exactly q is located within s(v). This implies that, 
if T* correctly answers range queries for every point q e s(v) then it must determine the location of the 
^-coordinate of q with respect to all points in T x (v) . More precisely, the leaves of T* could be relabelled 
to obtain a comparison tree that determines, for any q e s(v), which interval of T x (v) contains q x . Since 
T x {u) is a biased search tree for the probability measure D\ s ^, this implies that 

»d Mv) {T*) > (*d Mv) {T x {u)) - 1 . 
Similarly, the same argument applied to T v (v) yields 

MD| 5W (r*) > VD Wv) {T y {u)) - 1 . 
We can now complete the proof with 

M^) = X>r(i>)-A*iWT*) 

> J2Pt(v)- max {^ |sM (T x (u)), n D]<v) (T„(«))} -1 



2 

^■C 2 (S,D)-1 = Q(C 2 (S,D)) 



□ 
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4.2 The Tree Searching Step 



Next we bound the expected depth d T (q) of the node v of T such that q e s(v). We do this by showing 
that any decision tree T* for range counting in S must solve a set of point location problems and that 
the expected depth of v is a lower bound on the complexity of solving these problems. 

We say that a set of rectangles is HV- independent if no horizontal or vertical line intersects more 
than one rectangle in the set. We say that a set {vi, . . . , v^} of nodes in T is HV-independent if the set 
{r(v\), . . . , r(vk)} is HV-independent. 

Lemma 2. Let S be a set of n points and let D be a probability measure over R 2 . Let T = T(S, D) be 
the biased range tree for (S, D) and label each node of T white or black, such that all white nodes are at 
distance at most i from the root of T. Then, if T contains more than 7* white nodes then T contains an 
HV-independent set of white nodes of size ^((7 / V2) 1 )- 

Proof. Define a graph G = (V, E) whose vertices are the white nodes of T and for which uv € E if and 
only if there is a horizontal or vertical line that intersects both r(u) and r(v). Note that an independent 
set of vertices in G is an HV-independent set of which nodes in T. Thus, it suffices to find a sufficiently 
large independent set in G 

A well-know result on fc-d trees states that, for a fc-d tree of height i, any horizontal or vertical 
line intersects at most 2 T 2 -/ 2 ! rectangles of the fc-d tree [8, Lemma 5.4]. Therefore, since T is a fc-d treej^] 
the number of edges in G is at most |V| ■ 2^/ 2 T. This implies that G has a vertex v of degree at most 
2 r«/ 2 l + 1 an d this is also true of any vertex- induced subgraph of G. 

We can therefore obtain an independent set in G by repeatedly selecting a vertex v of degree 
2^/ 2 1 +1 , adding v to the independent set and deleting v and its neighbours from G. Since, at each step 
we add one vertex to the independent set and delete at most 2 T J / 2 1 + 1 + 1 vertices from G, this produces 
an independent of size n{\V\/2 t / 2 ) = fl((j/V2Y), as required. □ 

We can now provide the second piece of the lower bound. 

Lemma 3. Let S be a set of n points and let Dbea probability measure over R 2 . Let T* be any comparison 
tree that does range counting over S. Let C±(S,D) denote the expected depth of the node v of the biased 
range tree T = T(S, D) such that q e s(v). Then 

MT*) = n(c 1 (s,D)) 

Proof. Partition the nodes of T into groups G\, G2, ■ ■ ■ where G; contains all nodes v such that 1/2 1 < 
Pr(v) < l/2 t_1 . Observe that the nodes in group d occur in the first i levels of T. Select a constants 7 
and P with \/2 < 7 < /3 < 2 and define a — 7/V2. By repeatedly applying Lemma [2] each group Gi can 
be partitioned into groups G^i, . . . , Gi t t t where, for each 1 < j < ti, Gij is an HV-independent set with 
\Gij\ > a\ Furthermore, |Gj iti | < 7*. (Note that Gi. ti is not necessarily HV-independent.) 

Consider some group G^.j for 1 < j < ti. Let I be a leaf of T* and observe that, because the 
nodes in Gi, 3 are independent and each one contains at least one point of S in its closure, there are at 

2 Although T is not exactly a k-d tree as described in Reference [8 1, the proof found there still holds. 
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most 4 nodes v in dj such that r{t) intersects the closure of r{v). (Otherwise r{l) contains a point of S 
in its interior and therefore T* does not solve the range counting problem for S.) Thus, by performing 
2 additional comparisons, T* can be used to determine which node of v € Gi.j (if any) contains the 
query point q in s(v). However, Gij contains ^(a 1 ) nodes and the search path for q terminates at each 
of these with probability between 1/2 1 and l/2 l_1 . Therefore, if we denote by Dij the distribution D 
conditioned on the search path for q terminating in one of the nodes in dj then we have, by applying 
Theorem [T] 

toi ,,(T*)+2 > Yl Pr ( V I G i,3) 1 °g( 1 / Pr ( W I G i,j) 

veGi, 3 

> J2 P'(«|G W )log(n(a*)) 

> Iog(n(a')) 

= iloga — O(l) . 

Putting this all together, we obtain 

OO ti 

i=l j=l 

OO ti-1 

oo itj — 1 

> ^E Pr (G y )(»log«-0(l)) 

oo 

> (togaJ.^E £ Pr(«)-d(«)-0(l) 

oo 

= (loga)-J2^)-d(v)-J2 E Pr(«) ■ d(«) - O(l) 

v£T i=l v£G ilt . 



oc 



> (log a) • £ Pr(«) • d(v) Pr ( G i,ti) - °(!) 

[log n] 

> (loga).£;Pr(«)-d(t;)- £ tY/*" 1 - 0(1) 

u£T i=l 

> (log a) -J^Prfa) •«!(«) -0(1) 

= n(Ci(5,i?)) , 

where the last inequality follows from the fact that 7/2 < 1. □ 

To get some idea of the constants involved in the proof of Lemma[3] we can select 7 = 1.6, so that 
a = 1.6/V2 w 1.13137085 and log a w 0.178071905 and the 0(1) term is approximately 20. Thus, for this 
choice of parameters, the depth in T is competitive with T* to within a factor of 1 /0. 178071905 w 5.615 
and an additive constant of 20. Alternatively, selecting 7 = 1.8 gives a constant factor less than 3 and an 
additive term of approximately 90. 
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And now the main event: 



Theorem 2. Let S be a set of n points and let D be a probability measure over R 2 . Let T = T(S, D) be the 
biased range tree for S and D and let T* be any decision tree that answers range counting queries for S. 
Then 

Hd{T*) = f2(/x_o(T)) . 

Proof. By the definition of C\ and C 2 , the expected cost of searching in T is ^d{T) = 0(Ci(S,D) + 
C 2 (S,D)). On the other hand, by Lemma fj and Lemma [T] fi D (T* ) = Q(msx{C 1 (S,D),C 2 (S,D)}) = 
nld(S, D) + C 2 (S, D)) = Sl(fi D (T)). This completes the proof. □ 



5 Summary, Discussion, and Conclusions 



We have presented biased range trees, an optimal data structure for 2-sided orthogonal range counting 
queries when the point set S and query distribution D is known in advance. The expected time required 
to answer queries with a biased range tree, when the queries are distributed according to D, is within 
a constant factor of any decision tree for answering range queries over S. Like standard range trees, 
biased range trees use O(nlogn) space and can also answer semigroup queries flQl 15] ^Although the 
analysis of biased range trees is complicated, their implementation is not much more complicated than 
that of standard range trees. 

As a small optimization, the backup range tree data structure can be eliminated from biased 
range trees. Instead, once the probability of a node v drops below 1/n the node can be split by ignoring 
the distribution D and simply splitting the points of r(v) n S into two sets of roughly equal size. This 
results in a tree of depth at most 2 (log n+1). 

This work is just one of many possible results on distribution-sensitive range searching. Several 
open problems immediately arise. 

Open Problem 1. Are there efficient distribution-sensitive data structures for 3-sided and 4-sided orthogo- 
nal range counting queries? 

Note that a 4-sided orthogonal range counting query can be reduced to 4 2-sided orthogonal 
range counting queries using the principle of inclusion-exclusion. Unfortunately, this reduction does not 
produce an optimal distribution-sensitive data structure. To see this, consider 4-sided queries consisting 
of unit squares whose bottom left corner is uniformly distributed in the shaded region of Figure [5] All 
such queries contain no points in the query region and all such queries can be answered in 0(1) time 
by simply checking that all four corners of the square are to the left of the point set. However, when we 
decompose these queries into a four 2-sided queries we obtain 2-sided queries that require f2(log n) time 
to be answered. 

Open Problem 2. Biased range trees require that the point set S and the distribution D be known in 
advance. Is there a self-adapting version of biased range trees that, without knowing D in advance, can 
answer m queries, each drawn independently from D in 0(nlogn + m//_o(T*)) expected time? 

3 That biased range trees can answer semigroup queries follows from Properties 1-3 of the catalogues in Section liOl 



13 



Figure 5: Decomposing a 4-sided query into four 2-sided queries can produce a bad distribution of 
2-sided queries. 

Open Problem 3. Determine the worst-case or the average case constants associated with 2-dimensional 
orthogonal range searching for comparison-based data structures. By applying the result of Adamy and 
Seidel [1 J on point location to the arrangement A described in Section\2\one immediately obtains an 0(n 2 ) 
space data structure that answers queries using at most 21ogn + O(loglogn) comparisons. Is there an 
0(n log n) space structure with the same performance? 

Open Problem 4. A point q e R d is maximal with respect to S C R d if no point of S has every coordinate 
larger than the corresponding coordinate of q. For d > 3, is there a distribution-sensitive data structure 
for testing if a query point q is maximal? For point sets in 2 dimensions, an orthogonal variant of the 
point-location techniques of Collette et al $7$ seems to apply. 

Open Problem 5. Are there distribution-sensitive data structures for d-sided range search in point sets 
in R d ? The current fastest structures for range search in point sets in R d that use near-linear space have 
0(log d_1 n) query time. Is there a structure that uses near-linear space and is optimal when the point set S 
and the distribution D are known in advance? 
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